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Abstract 



We observe a N x M matrix of independent, identically distributed Gaussian ran- 
dom variables which are centered except for elements of some submatrix of size nx m 
where the mean is larger than some a > 0. The submatrix is sparse in the sense that 
n/N and m/M tend to 0, whereas n, m, N and M tend to infinity. 

We consider the problem of selecting the random variables with significantly large 
mean values. We give sufficient conditions on a as a function of n, m, N and M and 
construct a uniformly consistent procedure in order to do sharp variable selection. We 
also prove the minimax lower bounds under necessary conditions which are comple- 
mentary to the previous conditions. The critical values a* separating the necessary 
and sufficient conditions are sharp (we show exact constants). 

We note a gap between the critical values a* for selection of variables and that of 
detecting that such a submatrix exists given by [7]. When a* is in this gap, consistent 
detection is possible but no consistent selector of the corresponding variables can be 
found. 

Keywords: estimation, minimax testing, random matrices, selection of sparse signal, 
sharp selection bounds, variable selection. 
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1 Introduction 



We observe random variables that form anJVxM matrix Y = {lij}i=i ... at j=i,...,M : 

Yij = Sij + £ij, i = l,-..,N, j = l,...,M, (1.1) 

where are i.i.d. random variables and G R, for all i G {1, ...,iV}, j G {1,...,M}. 
The error terms £jj are assumed to be distributed as standard Gaussian M(0, 1) random 
variables. 

Let us denote by 

C nm = {C = A x B C {1, . . . , N} x {1, . . . , M}, Card(A) = n, Card(B) = m} , (1.2) 

the collection of subsets of n rows and m columns of a matrix of size N x M. 

We assume that our data have mean Sjj = except for elements in a submatrix of size 
n x m, indexed by a set Co in C nm , where s« > a, for some a > 0. 

Our model means that, for some a > which may depend on n, m, N and M, 

3 Co G C nm such that = 0, if (i, j) Co, and s^- > a, if (i, j) G Co- (1-3) 
Let 

S n m,a be the collection of all matrices — Sc, C G C nm that satisfy f) 1 . 3 j) . Our 
model implies also that there exists some Co in C nm such that S = Sc belongs to S nrn ^ a . 

We discuss here only significantly positive means of our random variables. The problem 
of selecting the variables with significantly negative means can be treated in the same way, 
by replacing variables with —Yij. 

Denote by Pq the probability measure that corresponds to observations (|1.1|) with 
matrix S = S c = }i=i,...,jv,j=i,...,M, Sy = if C, Sy > a > if (i, j) G C. We 
also denote Po = Pc an d the expected value with respect to the measure Pq. 

Our goal is to propose a consistent estimator of Co , that is to select the variables in the 
large matrix of size N x M where the mean values are significantly positive. Our approach 
is to find the boundary values of a > 0, as function of n, m, N and M, where consistent 
selection is possible and separate them from the cases where consistent selection is not 
possible anymore. 

We are interested here in sparse matrices, i.e. the case when n is much smaller than 
iV and m is much smaller than M. 

Large data sets of random variables appear nowadays in many applied fields such as 
signal processing, biology and, in particular, genomics, finance etc. In genomic studies 
of cancer we may require to detect sample- variable associations see [T7]. Our problem 
further adresses the question: if such an association is detected can we estimate the 
sample components and the particular variables involved in this association? 
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We may also view our problem as a matrix-mixture model, where each observation Y^ 
has distribution 

Y ij ~(l-p)-N(0,l)+p-N(s ij ,l), 

with p = Pn t N,m.M £ (0, 1) the mixture probability (small) and > a for (i, j) £ Cq. Such 
models appear, for example, in multiple testing setup where Y{j are test statistics, which 
are i.i.d. under the null hypothesis and they have a Gaussian distribution. Benjamini 
and Hochberg [3] proposed to study the false discovery rate and many models have been 
proposed since for estimating p and the mixture density of the observations in the non 
Gaussian case. In our approach the multiple tests are indexed by 6 {1, — ,-ZV} x 

{1, ...,M} such that the mixture occurs with a submatrix structure. We address here the 
question of selecting the multiple tests which are significant (have rejected the null) in a 
matrix setup, and, as a particular case, in a vector setup as well. This problem is also 
known as classification and it was known that in some cases classification is not possible 
even though detection is possible, see [Sj. Our result provides new rates for the matrix 
case and sharp constants for the vector case. 

Sparsity assumptions were introduced for vectors. There is a huge amount of literature 
for variable selection in (sparse or not sparse) linear and nonparametric regression, gaus- 
sian white noise and density models. Estimation of the sparse vector as well as hypothesis 
testing for vectors were thoroughly studied under various sparsity assumptions as well. See 
for example Bickel, Ritov and Tsybakov [6j and references therein, for estimation issues, 
and Donoho and Jin [10J, Ingster [12] and Ingster and Suslina [2], for testing. 

In the context of matrices, different sparsity assumptions can be imagined. For exam- 
ple, matrix completion for low rank matrices with the nuclear norm penalization has been 
studied by Koltchinskii, Lounici and Tsybakov [T5] . 

The detection problem was considered in this setting by Butucea and Ingster [7J. A 
more general setup, where each observation is replaced by a smooth signal was considered 
by Butucea and Gayraud [8]. We can apply our results to their setup in order to select 
the signals with significant energy (norm larger than a). 

We study here the variable selection problem in a matrix from a minimax point of 
view. A selector is any measurable function of the observations, C = C({Yy}) taking 
values in C nm . For such a selector C = C(Y), Y = {Yij} we denote the maximal risk by 

Rnm,a(C)= SU P P Cq (C(Y) ^ C ). 

We define the minimax risk as 

Rnm,a — Ujf Rum, a (C) • 
C 

From now on, we assume in the asymptotics that N — > oo, M — > oo and n = n^M 
oo, n N, m = rriNM ^ oo, m M. Other assumptions will be given later. 
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We say that a selector is consistent in the minimax sense, if R n m,a(C) —> 0. 

We suppose that a > is unknown. The aim of this paper is to give asymptotically 
sharp boundaries for minimax selecting risk. It means that, first, we are interested in 
the conditions on a = a at a/ which guarantee the possibility of selection i.e., the fact that 
Rnm,a — >• 0. We construct the selecting procedure 

C*(Y) = arg max £ (1.4) 

We investigate the upper bounds of the minimax selection risk of this procedure. Second, 
we describe conditions on a for which we have the impossibility of selection, i.e., the lower 
bounds R n m,a 1- These results are called the lower bounds. The two sets of condition 
are partially complementary in a sense that violation of the upper bound conditions imply 
either impossibility of selection or indistinguishability (see [7j). 

Remark 1.1 Note that Pc (C*(Y) / Co) does not depend on Co = Cq(N, M,n,m,a). 
Therefore, for any Co we have 

R nm ,a(C*) = Q max P Co {C*{Y) + C ) = P Co (C*(Y) + C ). 



The problem of choosing a submatrix in a Gaussian random matrix has been previously 
studied by Sun and Nobel [16] . They are interested in the largest squared submatrix in Y 
under the null hypothesis such that its average is larger than some fixed threshold. The 
algorithm of choosing such submatrices was previously introduced in Shabalin et al. [17j . 

The plan of the paper is as follows. In Section we state the main results of this 
paper: the upper bounds for the selection procedure C* under conditions on a, as well 
as inconsistency property of this procedure under complementary conditions on a, and, 
finally, lower bounds for variable selection. We compare these results with the results 
for detection in [7j. We give results for the vector case (m = M = 1) which are new 
as far as the asymptotic constant is concerned. In Section [3] we prove the upper bounds 
for the selection of variables, that is a bound from above on a, in which R n m,a(C*) = 
sup5 CQ Pc (C* 7^ Co) — > 0. In Section d] we prove lower bounds for variable selection, 
that is, a bound on the parameter a from below which imply that the minimax estimation 
risk R nm ,a tends to 1. Two techniques provide the sharp lower bounds. One method is 
classical for nonparametric estimation, while the other makes a generalization of a well- 
known result to testing L > 2 hypotheses: the minimax risk is larger than the risk of the 
maximum likelihood estimator. 

Future extensions of this problem include several open problems. For example, consider 
two-sided variable selection, i.e. finding Co where the mean |sy| > a, for G Co- 

Another possibility is to consider non Gaussian observations, but having distribution in 
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the exponential family. As mentioned, we may replace each observation with a smooth 
signal and detect the active components (signals with significant total energy) in the 
matrix. 



2 Main Results 

Let 

N — > oo, n — > oo, p = n/N — > 0; M — > oo, m — > oo, q = m/M — > 0. (2.1) 

We suppose that a > is unknown. The aim of this paper is to give asymptotically 
sharp boundaries for variable selection in a sparse high-dimensional matrix. Our approach 
is to give, on the one hand, sufficient asymptotic conditions on a such that the probability 
of wrongly selecting the variables in Cq tends to and, on the other hand, conditions 
under which no consistent selection is possible. 

First, we are interested in the conditions on a = a nm NM which guarantee consistent 
variable selection, i.e., the fact that we construct the selector C* in (jl.4p and prove that 
Rnm,a{C*) —> 0. The selector C* is scanning the large N x M matrix and maximizes the 
sum of the inputs over all n x m submatrices. 

The key quantities appearing in next theorems are 

B = B n m N M = min{j4i, A2, A}, where A = agrmi ^ 

(2.2) 

^4 _ asfm a _ a^n 

V2(^log(n) + ^/log(N-n)) ' v / 2(-y/log(m) + -y/log(M— m)) ' 

Let us consider the particular case where the matrix and the submatrix are squared (N = 
M and n = m) and, moreover, such that 

log(iV) ~ log(M) 

Then, log(ra(iV — n)) ~ log(iV/n) and log(m(M — m)) ~ log(M/m) which imply that 
A\ = A2 > A and, therefore, B = A. We need terms A\ = A2 in order to consider cases 
where liminf log(n)/log(iV) and liminf log(m)/ log(M) are large enough and close to 1. 

Another particular example is n ~ orm~ M<2, for P, Q £ (0,1) that we discuss 
in more details later on. 

For this reason, we distinguish the case of severe sparsity when B = A, from the case 
of moderate sparsity when B = A\ or B = Ai- 

The following Theorem gives sufficient conditions for the boundary a = a ni m,JV,M such 
that selection is consistent uniformly over tha class S nrrija . The selector which attains 
these bounds is defined by (jl.4p . 
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Theorem 2.1 Upper bounds. Assume pOP and assume B = B nm> N m defined by 
is such that 

liminf £„, m ,jv,M > 1, (2.3) 
then the selector C* given by ( fi.^| ) is consistent, that is 

Rnm,a(C*) = P Co (C* + C Q ) -> 0. 

Proof is given in Section [3J 

Condition (j2.3|) is equivalent to saying that 

liminf A > 1 and liminf A% > 1 and liminf A2 > 1. 

The following proposition says that lim inf > 1 and lim inf A2 > 1 are necessary condi- 
tions for the consistency (in the minimax sense) of the selector C* of Co- 

Proposition 2.1 Assume and let the selector C* be the selector given by j l.J$ - If 

lim sup A\ < 1 or lim sup A2 < 1 

then, for any Co such that Sc G S nrri)a) 

P Co (C*^C )->l. 



Proof is given in Section 14.21 

In the following theorem we give a sufficient condition on a under which consistent 
selection of Co is impossible uniformly over the set 5 n m,a- These are the minimax lower 
bounds for variable selection. 

Theorem 2.2 Assume \2.1\) . If, moreover, B = B nim ,N,M defined by is such that 

limsupB njTO> Ar,M < 1, (2.4) 
then there is no consistent selection of Cq uniformly over S nm a , that is 

inf sup P Co (C(Y) ^ Co) ^ 1, 

asymptotically, where the infimum is taken over all measurable functions C = C(Y). 

Proof of this theorem is given in Section 14.11 and 14.21 

Theorems 12.11 and 12.21 imply that the critical value for a is 



m 



a ~ max 



x /21og(n) + V21og(iV-n) V21og(m) + y / 21og (M 

j 

(2.5) 



y / 2(nlog(N/n) + mlog(M/m)) 
x/nm 
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By critical we mean in the sense that, for a such that liminf a/a* > 1, there is an esti- 
mator which is uniformly consistent, while, for a such that lim sup a/a* < 1, no uniformly 
consistent estimator exists. 

If we consider the particular case where n = N p and m = grow polynomially, for 
some fixed P, Q in (0, 1), the critical value becomes 

( *, 2 f 2(l + y / P) 2 log(iV) 2(l + ^) 2 log(M) 
(a ) ~ max < , , 

m n 



2(l-P)log(AQ 2(l-Q)log(M) | 



+ 

m n 

If, moreover, n = m and N = M, we get (a*) 2 ~ max{2(l + y A P) 2 ,4(l — P)}log(N)/n. 
So, the amount of sparsity depends on whether P is larger or smaller than 1/9. In this 
particular example, we have moderate sparsity, B = A\ = Ai < A, as soon as P > 1/9. 

2.1 Variable selection vs. detection 

Let us compare the result in Theorem 12.11 and Theorem 12.21 with the upper bounds and 
the lower bounds for detection of a set Cq where our observations have significant means, 
i.e. above threshold a. The testing problem for our model can be stated as 

H : Sij = for all (i,j) 

and we call Pq the likelihood in this CclSG, Si gainst the alternative 

H\ : there exists Co G C nm such that S = Sc £ S nm , a - 

Recall the following theorems. 

Theorem 2.3 Upper bounds for detection, see ^7/. Assume \2. 1}) and let a be such 
that at least one of the following conditions hold 

2 (anm) 2 . 

a nmpq = > oo or limmt A > 1. 

NM 

Then distinguishability is possible, i.e. 

inf [WF) = !) + SU P Pco^iX) = 0) ] -+ 0, 

where the infimum is taken over all measurable functions ip taking values in {0, 1}. 

It was also shown in [7], that the asymptotically optimal test procedure ip* combines 
the scan statistic based on our C* with a linear statistic which sums all observations 
Y = {Yij}ij. The test procedure ip* rejects the null hypothesis as soon as either the linear 
or the scan test rejects. 
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Theorem 2.4 Lower bounds for detection, see JW. Assume \2. 1\) and 

w -in , / -is loglog(p _1 ) loglog^ 1 ) 
niog(p J x mlog(g ), — — ; — - > 0, — — ; — - > 0. 



logiq- 1 ) 



Moreover, assume that 



2 (anmY 
a nmpq = 



Then, consistent detection is impossible, that is 



log^p- 1 ) 



and lim sup A < 1. 



(2.6) 



inf PoMY) = 1) + sup P Co mY) = 0) -> 1, 

^( Y ) \ S Co &Snm,a J 

where the infimum is taken over all measurable functions ip taking values in {0, 1}. 

We deduce that there is a gap between least conditions for testing that Co exists and 
selection of the actual variables E Cq (estimation of Co). In Table [27T1 we summarize 
possible cases were consistent selection and/or consistent testing is possible or not. We 
can prove that, if 

limsupA < 1, liminf A\ > 1 and liminf^ > 1 

then a 2 nmpq — > 0, hence Theorem 12.41 We used this in the conditions of the second case 
where neither consistent selection, nor testing is possible. 



Selection \ Test 


Yes 


No 


Yes 


lim inf B > 1 








Under (I2.6h for the test: 




1) lim sup B < 1 


1) lim sup A < 1 




and a 2 nmpq — > oo 


and a 2 nmpq — > 


No 








2) lim inf A > 1 and 


2) lim sup A < 1 and 




(lim sup A\ < 1 or lim sup A2 < 1) 


liminf Ai > 1 and 






liminf A2 > 1 



Table 1: Conditions for variable selection and/or testing 



Let us consider the following example: N = n 2 , M = log(n), m = loglog(n) (and, 
for instance, a 2 = log(n)/loglog(n)). For all a such that a 2 3> log(n)/(log log(n)) 2 as 
n — > 00, we have a 2 nmpq = a 2 (log log(n)) 2 / log(n) — > 00. Therefore, on the one hand, dis- 
tinguishability holds, see Theorem (12. 3p . i.e. we can construct a particular test procedure 
ip* such that 

W(Y) = 1)+ sup P Co ty*(Y) = 0)->0. 

ntri-.a 



S 



On the other hand, 

a 2 m a 2 loglog(n) 



2( v / bg^) + Vlog(iV-n)) 2 (2 + V2) 2 log(n) 



;i+o(i))<i, 



for all a such that a 2 < (1 - 6)(2 + \/2) 2 log(n)/loglog(n), 5 > 0. By Theorem 2.2, no 
consistent selection is possible in this case. 



2.2 Vector case 

Previous results can also be proven for the vector case, that is for the gaussian independent, 
observations 

Xi = Si + ii, i = l,...,N, 

where > a for all i in a set Aq of n elements and Sj = otherwise. We suppose 
n, N — > oo such that n/N — > 0. Similarly, we can show the following result. 

Theorem 2.5 Upper bounds In the previous model, if 

liminf — ■ ■ > 1, 

V21ogC/V) + V21og(r0 

then the estimator A* = arg max^ Yli^A %i is such that 

supP j4o (i*^A )^0. 

Lower bounds If 

T a 1 

hmsup — . . < 1, 

V21og(iV) + V21og(n) 

then 

infsupP Ao (^ £A Q ) -»• 1. 
A A 

The critical value is a* = y/2 log N + \Jl log(n). It is equivalent to \/2 log N if 
log(n)/log(iV) -»• and a* = y/2(l + ^/T^)^/logN if iV = n 13 for some /3 G (0,1). 
This result follows from [T3] (see Section 3.1, Remark 2 and references therein). 

Note that in the vector case, variable selection was mostly studied for the regression 
model with deterministic design, see e.g. [3], [19] and references therein. 

Our results are sharp as they give also the asymptotic constant. 

Let us stress the fact that the particular case we study here is fundamentally different 
from the matrix setup. Indeed, an additional regime is observed according to the sparsity 
structure of the submatrix (severe or moderate) and it cannot be obtained from previous 
results for vectors by, say, vectorizing the matrix. 
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3 Upper bounds 



Proof of Theorem 12.11 Note that 



P Co (C*^ Co) =P Co ( max £ Y tJ - £ > 0). 

We shall split the sets C according to the size of their common elements with the true 
underlying Co- Let C = A x B and Co = Aq x £?o and let be the number of elements in 
AO Aq and I the number of elements in BDBq. Then, if we denote by C nm ki the collection 
of such matrices C: 

Pco(C'VCo) = Pc max max max - 5^ > 

< Pc I max max max ( > £y — > — a(nm — kl)) > I . 

I fc=0,...,nZ=0,...,mCeC nm , fci ' f— ' I 

\ C\Co Co\C / 

From now, we fix < 5 < 1 and separate two cases: when kl < (1 — 5)nm and when 
fci" > (1 — 5)nm. As 5 will be chosen small, it means that we treat differently the cases 
where the matrix C overlaps Co but weakly (or not at all) and where the matrices overlap 
almost entirely. We write and deal successively with each term in 

P Co (C^C ) 

< Pn n I max max max ( > — > — a(nm — kl)) > I (3.1) 
\ C\Co Co\C / 

+-Pc max max max ( N ^ — N £y — a(nm — kl)) > ] . (3.2) 
\ C\Co Co\C / 

3.1 Weak intersection 

Let us fix k and / such that kl < (1 — 5)nm for some < 5 < 1. Equivalently, we have 
nm — kl > 5nm. In this case, we shall bound the probability in (|3.ip as follows 



-P<7„ I max max max ( > — > — a(nm — kl)) > I 
\ v ' ' C\C C \C ) 



n m 



< / fc«<(i-5)nm-Pc max ^ iij + max ^ ^ - ^ > a(nm - kl) 

k=oi=o \ nmM c\c nmM cnc Co 



n m 



fc=0 «=0 



10 



where we denote by Iki<(i-S)nm the indicator function of the set where kl < (1 — 5)nm 
and by 

Ti,ki = Pc I max ^ > (1 - <5i)a(nm - ZcZ) 
\ e nm ' fci C\C Q 

T 2 ,kl = Pco ( max V ^ > ^-a(nm - kl) 

\ CeCnm ' kl cnco 2 

T 3 ,m = Pc ( - ^2 Cij > ^a{nm - kl)) , 
\ Co J 

for some < 5\ < 1. 

Before continuing the proof, recall that, if n, N tend to infinity, such that n/N — > 0, 
we have 

Mcri) ~ ( „- J; ,iog(^) + (jv-2„ + mog(^^) 

and 



log(C n ) < min |(n - fc) log (^^T^J , Hog y k jj, 
for all k = 1, ...,n - 1 and logC™ = 0. 

In order to give an upper bound for Ti ^, we shall distinguish the case where k < 
(1 — S)n and I = m (the case k = n and I < (1 — <5)m is treated similarly) from the case 
fcZ < (1 — 5)nm, k < n and Z < m. On the one hand, if k < (1 — <5)n and I = m, we write, 
for a generic standard gaussian random variable Z (which might change later on): 

Ti,km < Pc max ^ > (1 - 6i)a(n - k)m 

\ e n ' fc (A\A )xB 

< C n N -\P[Z > (1 - <Ji)aV(n-A:)m) 

f (l-^l) 2 2 

< exp 



a 2 (n-k)m + log(C n N -_ k n )y 



where we use repeatedly that P(Z > u) < exp(— u 2 /2), for all u > 0. 
Now, 

log(C^t) < (n-*)log(^)(l + (l)). 



Therefore, 



7 L/ ,„, < ,( ( ' ' 8 <rni logf^— ^) 
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By assumption f)2.3f) we can say that 



mm 



a 2 nm 



2(nlog(p _1 ) + mlog(g -1 )) ' 

2 ^ 
n n I 

>l + a, (3.3) 



a 2 m a 2 n 



2(^log(iV - n) + 0og(n)) 2 2( V / log(M - m) + V^m)) 5 

for some fixed small a > 0. Therefore, if <5i > is small enough, we have some a\ > 
such that 

1 ' ~ ' h ) a 2 m > (1 + ai)(log((iV - n)n)) > log ( (1 + o(l)) + log(n), (3.4) 



2 ~ v /v ovv ' " °\n-k , 

asymptotically. Indeed, it is sufficient that (1 — 5i) 2 (l + a) > 1 + a\. 
We get 

T likm < exp(-(n - fe) log(n)). 

We conclude that 

^ ?l fcm < n max {exp(-(n - fc) log(n)} < n" 5n+1 = o(l). 

' k:(n~k)>8n 

k:(n—k)>8n 

On the other hand, if fcZ < (1 — 5)nm, k < n and I < m, note first that the maximum 
is taken over all C in C nm ki, but only the lines and columns outside Cq actually play a 
role over the sum 

£c\c There are ^-n " C M-m ■^•Ci different values of this 

sum. We write: 

T hkl < C n N -_ k n ■ CZZ l m ■ C k n ■ C l m P (Z>{\- SJaVnm - kl 

< C^-CZz} m -C k n -Cl^(-^—^a 2 {nm-kl) 

< exp (- { J—^a 2 {nm - kl) + log(C^^ m ^C^)) . (3.5) 
As we have n, m, N, M tend to infinity, then 

MCjT-n • Cm-L • C* • Cl) < [{n - k) log ^ + (m - I) log ^-=^) (1 + (1)) 

. , . , ne . ? \ . me 
+(n — fej log + (m — tj log 



n — k m — I 

( i ,\, N — n , ... M — m\ . 

< (n-fc)log h(m-Z)log (l + o(l)) 

\ n m 

(ft TTl 
(n - k) log + {m-l) log ) (1 + o(l)) 

n — k m — I 

+ (n — k) log + [m — 1) log 



n — k m — l 

Let us see that (N — n)/n = N/n(l + o(l)) and that 
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as x(l — 2 log (a:)) < for all x in [0, 1]. 

Let us denote X := nlog(p^ 1 ) and y := mlog(q^ 1 ). We have 



log (q:t • C M~-m ■ C n- Cl rn)<(( 1 -^) X +( 1 -i)y + ^ + ™)) (1 + o(l)). 



Analogously to (I3.4p we have 

(i-ft) s 



■a 2 nm > (l + ai )(X + y), 



asymptotically. 

Finally, we get, for large enough n, m, TV, M 

- (1 "^ V (" - kl) + log (C5J-* • C™Z l m • C n fc • 
\ nm J 

—)(X + y) + ( ( 1 - - )x + (l - - )y + -^(n + m) ) (1 + o(l)) 
ai / fcZ \ k ( I .\ Z / A 2 



< --i 1 U X + y) + --1 U + - --1 y+ ( n + m )(i + (i)) 

2 \ nm / n \m J m \n J ye 

< -^(Ar + y) + -^(n + m)(l + o(l)). 
Therefore, we replace this bound in (|3.5p and get 

n m 

^ Ikl<(l-S)nmTl,kl 
k=0 1=0 

< 2exp ( — — 5(nlog(p^ 1 ) + mlog(q~ 1 )) H — = (n + m)(l + o(l)) + log(nm) ) = o(l). 

V 2 V e / 



For T2 j fci, only the common elements of C and Co play a role on the random variable 
Scnc £y ano - there are C% • such choices. Note that we cannot have here neither 
k = nor I = 0, as T2 j / C ; = in this cases. Therefore, 

n m n m / r r i i\\ 

2^Z^ J ' fc '<( 1 - 5 ) rami2 ' H - 2^ 2^ °™ ' °™M Z > Z~7= I 

fc=l Z=l fc=lZ=l V 2VKi / 

n m / j- j- \ 



n_i 1.1 - 2^/(1 - i)nnV 

v^v^ / ofo a nm , , /ne\ /me\\ 

^ EE-p(-i^ + fclo <i)+"^(T)j 



/ 5 2 5 2 a 2 nm , , 

< exp I — — — — + n + m + log(nm) ) = o(l). 
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Here, we have used the fact that xlog(x^ 1 ) is bounded from above by e _1 for all 
x G [0,1] and used it for x = k/n and for x = l/m, respectively. Use (j3.3[) in order to 
conclude. 

Finally, for T^i, we write that — ^c £ij/V nm behaves like some standard Gaussian 
random variable Z and get 

8\a 2 (nm - ktf 



n m n m / x2 2f ^ 1 A 2 

k=0 1=0 k=0 1=0 



8nm 



5 2 5 2 a 2 

< exp ( ^— — nm + log(nm) I = o(l) 



as a 2 nm tends to infinity faster than log(nm) due to (I3.3P in our setup. 
In conclusion, the probability in (|3.ip tends to 0: 

Pc max max max ( N £^ — N £jj — a(nm — kl)) > = o(l). (3.6) 

71771 G^Cfi-fn.kl s-i \ s-i I 

\ C\Co Co\C / 

3.2 Large intersection 

Let us fix k and I such that kl > (1 — 5)nm, or, equivalently, nm — kl< 5nm. Note that it 
implies both fc > (1 — 5i)n and I > (1 — <5i)m for some <5i depending on 5 small as 5 — > 0. 
The case n = k and to = / gives an event with probability. 
We decompose as follows 

E ~~ E = I E — E ^ 

C\C C \C \(A\A Q )xB (A \A)xB 

+ E E & 

\A x(-B\B ) A x( J B \B) 

+ E E & + E E & 

\(A\Ao)x( J B\B ) (A\Ao)x(Bo\B) (A \A)x(B \B) (A \A)x(B\B ) 

= Si + S 2 + S3, say. 

We shall bound from above as follows 

Pen max max max ( > fj, — > — a(nm — kl)) > 
V - v ' ~ y ' ' C\C C \C J 

( m + 1 \ 
< Pn n max max max (Si — (1 — 8)a(n — k) ) > 

+Pnn ( max max max (So — (1 — S)a(m — I) — — — ) > 
\k>(i-6i)nl>(i-Si)mBeC mtl 2 



+Pc ( max max max (S3 — 5a(nm — kl)) > ) , 

^k>(l—Si)n />(!— 5i)m CeC nmj fcj 
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where C n ^ is the set of n rows in 1, N having k values in common with Ao and similarly 
for C m i set of m columns in 1, ...,M having I values in common with Bq. Moreover, the 
previous sum can be bounded from above by 



V P Co ( max 5i > (1 - 5)a(n - k)m(l - 5 1 /2)) 
fc>(l-*)n VA6C "' fe ' 

+ Yl P Co ( max S 2 > (I- 5)a(m - l)n(l - St/2)) 

l>(l—di)m 

+ ^2 ^2 ( max > ^o-( nm — kl) ] 

k>(l-6i)nl>(l-8i)m \ CeC ™< kl ) 

= U h*> + z2 U W + say ' 

fc>(l-5i)n «>(l-5i)m k>{l-5±)nl>(l-5i)m 

Let us now deal with U\^i- Note, first, that the case k = n gives probability 0. For 
(1 — <5i/2)n < k < n — 1, we put p raj Ar = ^/\og(N — n) / (\f\og{N — n) + ylog(n)) and 

<?n,7V = 1 — Pn,N i 

Ui,k < Pc I ^ax X] > (1 - 5)(1 - 5i/2)a(n - /c)mp niA r 

+P Co A m ax ^ > (1 - - <5i/2)a(n - k)mq n ^ N 

and, for some independent standard gaussian r.v. Zi and Z2, using £ > (1 — 5i)m 

U 1>k < C n N '\P{Z x > (1 - S)(l - 5 l /2) Pn , N a^/{n-k)m) 
+C*P(Z 2 > (1 - S)(l - 5 1 /2)q^ N a^(n - k)m) 

< r-T f (1 = a2m{H = ^ l ° g(N - ^ I lQ g (C"- fc ^ 

+ ex P {-^rTr^rPr^ + log( ^) • 

\ 2 (A/log(iV - n) + V lo g( n )) / 

with 1 - 8 = (1 - 5)(1 - 5 1 /2). Note that log(Cj^J <{n — k) log(N - n)(l + o(l)) and 
that log(C*) < (n - k) log(ra)(l + o(l)). We obtain 

^ £ "» (-'" - k)] ° SiN ~ "> - ») + - (1 + ° (1)> )) 

+ exP (-,„ . ,) i„ gW ( (1 -/ )2 (Vlog(Ar _°;;; vlOE( „ ) , - a + »(!»)) . 



We use (I3.3p . for small enough (5 



(l-5)Vm > (1 + 2a 2 )2( v / log(n) + 0og(7V - n)) 2 , 
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for some «2 > and this means 

(1 — 5) 2 a 2 m 



-(l + o(l)) >2a 2 -o(l) >a 2 . 



2 ( Vlog(AT - n) + VlogR) 2 
Finally, 

fe < ^ ^ e -a 2 log(AT-n)(n-fc) _|_ g -a 2 log(n)(n-fc) 

(l-<5i)n<A;<n (l-<5i)n<fc<n 

< ^ ^ e -"2log(Ar-n)j _|_ e -a2log(n)j-j 

l<j'<<5in 

= ^-a 2 log(JV-n) + e -a 2 logH)^ + = 



The term U2J is similar. 

As for the last term, U 3 ^i, we compare each sum in S3 to 5a(nm — kl)/4. The most 
difficult (the largest) upper bound is for the first sum, as it gives the largest number of 
choices Cjv-n^M-m- Note that this term is if k = n or I = m. Therefore, we only 
explain this term, for k < n — 1 and I < m — 1, 

U 3 i,kl = Pc Q I max ^ &i > ja(nm - kl) 

\ (A\A )x(B\B ) 



(5/4) 2 a 2 (nm-kl) 2 



< exp — 



2 (n- fc)(m -I) I 
(6/4) 2 a 2 (n(m - l)P k , n + (n - k)mP w ^ 2 



■ mi 



2(n - k)(m - I) 
+(n - fc) log(iV - n) + (m - /) log(M - m)) , 

where Pfc 5 „ = 1 — (n — k)/{2n) and P^ m = 1 — (m — I) /{2m). Recall that n — k< 5\n and 
that m — I < Sim. We get 

/ (S/4) 2 a 2 \ 
U 3lM < exp (H + 2nmP M P,, m ) + <5i(nlog(7V - n) + mlog(M - m)) , 



where 



n 2 , „ „ 9 , , s m 2 2 . 1 



n 

\2 



Recall that P k>n > 1 - ffi/2 and P^ m > 1 - 5i/2. We get for ((5/4) 2 = ft: 
( a 2 

U 31M < exp I ~ — {nPl n + mP? n ) - b x {a 2 nmP Kn P^ m - (nlog(iV - n) + mlog(M - m)) 



with 



a 2 nmP k , n Pi,m > (1 - Si/2) 2 (^a 2 nm + ^a 2 nm) 

> (1 - (5i/2) 2 (l + a)(nlog(n(JV - n)) + mlog(m(M - m))), 
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by (|3.3p . By taking 6\ small enough, we may find 62 > such that (1— 5i/2) 2 (l+a) > 1+62- 
This is enough to conclude that 

a 2 nmPk n Pi m — (nlog(N — n) + mlog(M — m)) > 

and that 

U 3hkl < exp ^-y(n + m)(l -5i/2) 2 

< exp (-(1 - 5i/2) 2 (l + a)(log(m(M - m)) + \og{n{N - n)))) 

< exp (-(1 + <5 2 )(log(m(M - m)) + log(n(iV - n)))) . 

In conclusion, 

Yl U 31M <exp(-(l + S 2 )log((M-m)(N-n))-5 2 log(nm)) = o(l). 

(1—Si)n<k<n (l—8i)m<l<m 

Here, we have proven that 

P Co max max max ( V f fi - V - a(nm - W)) > = o(l). (3.7) 

l k.l kl>(l— 5)nm C£ C„ m hi ., — / 
\ ' C\C C \C / 

From (|3.7p and (|3.6p we deduce that the probability Pc (C* 7^ Co) tends to and this 
concludes the proof of the upper bounds. □ 

Remark 3.1 We have investigated the upper limits of the selector C* under the assump- 
tion that s^ = a, G Co- It follows that, when s^ > a, (i,j) G Co, statements of 
upper bounds stated in this section are valid. 

Indeed, the random part of the expansion Yq — Yc is independent of s, L j. The absolute 
value of the deterministic part (the difference of expectations) attains its minimum when 



4 Lower bounds 

Let (|2.ip and (|2.4p . We shall call the case when B = A the case of severe sparsity, while 
the case where either B = A\ or B = A 2 will be designated by moderately sparse cases. 
Let us first consider a set G of matrices having size N x M and containing Sc, for all 
C G C nm , such that [Sc]ij = a ■ I((i,j) G C). This set is on the border of S nm ^ a , as we 
replace [Sc]ij > a with equality, for all G C. The set has L = C 1 ^ ■ elements. 
Let .Po denote the likelihood of N x M standard gaussian observations and, as previously, 
Pc the likelihood of our observations under parameter Sc- The minimax risk is bounded 
from below by the minimax risk over 0: 

inf sup P C (C(Y) / C) > inf sup P C (C(Y) ^ C). 

C Sc&Snm,a C 5^6© 
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4.1 Severe sparsity 

Proof of Theorem 12.21 for severely sparse case 

In this case, we shall apply Theorem 2.4 in |18j : if there exists r > and < a < 1 
such that 



5 c ee 

then 



inf sup P C (C(Y) ± C) > -^-(1 - a). 



In our model, the likelihood ratio is 



dP c 

This implies that 



i = exp -a Y * + -o- • (41) 



Pc >rj = Pel -a ^ ^ + — > log(r) 



A- £ e 

nm — 



ay/nm ^ log(r) 



> 




Jnra 2 

p(z> log ^ | a ^""- 

\ ~~ ayjnm 2 

where Z is standard gaussian. Let be the quantile of probability 1 — a of a standard 
gaussian distribution, such that P(Z > —zi- a ) = 1 — a. In order to check (14. we need 
log(r) < —a 2 nm/2 — z\^ a a\Jnvn. 

On the one hand, if a^/nm = 0(1) we take r as solution of the equation log(r) = 
—a 2 nm/2 — z\- a a\Jnvn. Therefore, we have rxl and then 

tL 

(1 - a) > (1 - a) 2 > 0, as L -> oo. 



1 + rL 

On the other hand, if a^/nm — > oo, we take I-" 1 = L/log(L), with L = C^C^}, which 
gives tL — > oo and log(r _1 ) ~ log(L). We can prove that 



, , _ u a 2 nm , a 2 nm / 2* 

iog(T ) > h zi_ Q avnm = 1 + 



2 2 \ ay/nm / 

Indeed, we known that log(L) ~ nlog(p _1 ) + mlog(g -1 ) and, by assumption (|2.4j> . 



a 2 nm 



2(nlog(p 1 ) + mlog(g x )) 
asymptotically, for some 5 > 0. It implies that 

a 2 nm ( 2zi-, 



<1-S, 



21og(r 1 ) \ a^/nm 
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asymptotically. This gives the lower bound 

tL 



I- a) > (1- ay > 0. 



1 + tL 

As a > can be chosen arbitrarly small, we obtain the result 

inf sup P C {C{Y) ^ C) ->■ 1. 

c s c ee 

□ 

4.2 Moderate sparsity 

Lemma 4.1 // ryi, rjj are i.i.d. random variables with standard gaussian law, then 



ift<l, P( max r]j > tyj zlog(J)) — >• 1, as J — > oo, 
j=i,...,j 



and 



ift>l, P( max n,- > i-\/21og(J)) -> 0, as J ->• oo. 
j=i,...,J 



Proof This Lemma is an obvious consequence of the limit behaviour of the normalized 
maximum of i.i.d. Gaussian random variables as follows: 

Vj := max W21og(J) - 21og(J) + \ log(log(J)) + ~ log(4vr) -t d U, 
3=1,..., J 2 4 

where U has the Gumbel law with distribution function P(U < x) = exp(— exp(— x)) for 
all real number x, see |11| . Therefore, if t < 1, 

1. „ 1 



P( max Vj > ty/2\og(J)) = P(Vj >(t- l)21og(J) + - log(log( J)) + -log(47r)), 
j=l,...,J 2 4 

which tends to 1 when J — > oo. The other limit is obtained by a similar argument. □ 

Proof of Proposition 12.11 Let us assume that lim sup A\ < 1 and treat the other 
case similarly. This means that A\ < 1 — a, for some fixed < a < 1. Equivalently, 
< (1 - a)(-v/21og(n) + V21og(./V - n)). 

In this case we shall reduce the set of matrices C to those matrices having the same 
columns as Cq and n—1 rows in common with Co- Then we sum up each line over these 
columns and reduce the problem to the vector case. Thus, 

P Co (C* / Co) = P Co (max £ ^ - £ Y 4j > 0) 

« fc*-'77.777 

C C 

^ ^ ( E ^ - E ^ > °) 

° C Co 

> P Co (maxEn-E yi> °)' 
A Ao 
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where the maximum over A is taken over all sets of n rows having n — 1 rows in common 
with Aq and 

Yi. := ^ Yij = aml{i G A ) + ^ fa. 
j&B jeB 

Denote by r\i = m" 1 / 2 ^2j eBo fa for i = 1,...,N, which are i.i.d. random variables of 
standard gaussian law. Therefore, we get 

Pco(CVCo) > Fco(ma X ^r ? i-^(r/ i + a v / ^)>0) 

A A 

> Pc (max r]i + max(— r/fc) > a^/m) 

i£A keA 

> P Co (raaxr]i + max(-r ?fc ) > (1 - o)( v / 21og(iV - n) + v / 21og(n))) 



= 1 - Pc^maxrji + max(- % ) < (1 - a)( \og(N - n) + ^/21og(n))), 
by the assumption on Ai. Moreover 

Pc ( maxr ?i + max(-7? fc ) < (1 - a)(y / 21og(A r - n) + y / 21og(n))) 



< P Co (max^ < (1 - a)V21og(JV-n)) + P Co (max(-%) < (1 - a) ^2 log(n)), 

which tends to 0, by Lemma 14.11 □ 

Proof of Theorem 12.21 for moderately sparse case. 

In this case we check that the minimax risk is bounded from below by the risk of the 
maximum likelihood estimator C* and that its risk tends to 1 under our assumptions by 
Proposition 12.11 Let us see that 



1 L 

inf mp^P c (C(Y) ^ C) > inf j J2 p c k (C(Y)^C k ) 

" k=l 



c s c ee c L 



^ ^ y 1 ~lIl p c k (C(Y) = C k )j 

> 1 _ sup ^J2Eo(I(C(Y) = C k )^(Y)), 
c L ti dP * 



where L = C^C^j- is the number of elements in G. In the previous supremum, we may 
replace the arbitrary measurable function C(Y) by a test function ip(Y) taking values in 
1, ...,L. The test maximising 

sup j J2E (lWY) = k)^ (Y)) 
will choose k such that C k has maximal likelihood: {Y : (Y) > d p^ (Y), for all j = 
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1, L}. Thus, we get the risk of a maximum likelihood estimator, 



inf sup P C (C(Y) + C) > l-|E^(^)=ft) 



C S C £& 




k=l 

which tends to 1 by Proposition 12.11 



□ 
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